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Empirical Data on Criterion-Referenced Tests 



Although criterion-referenced measurement is not a brand new 
invention* itc recent marriage with individualized instruction and instruc- 
tional tecb n '_logy attracts some: interest in the area of measurement . 
Unfortunately* not too many people can agree on exactly what criterion- 
referenced measurement is. Therefore, I do not expect you to agree entirely 
with my version of criterion-referenced tests. In this paper I will describe 
some of our experiences in the analysis of criterion-referenced test items 
for the Individually Prescribed Instruction ( I P I ) program at the University 
of Pittsburgh. 

The version of criterion-referenced tests which will be used in 
this discussion is structured as show in the sample in Table 1. Let us 
assume there are four behavioral objectives (or classes cf behavior). The 
number of items and mastery levels are identical for each objective in this 
example, but this is a coincidence. In actual practice., the number of items 
per objective and the mastery level for each objective may vary according 
to the nature of the objective. A mastery level is defined as the cut-off 
score which is used to declare a student a master or non-master for each 
criterion behavior. It is not unusual for a test to consist of only a 
single objective, . especia) ly if the test » * goirg to be used in instruction. 
Howev r, if more than one objective is included In a test, items for each 
objective can be grouped together as a subtest. Test scores referred to 
in this paper are subtest scores, that is* a separate score for each objec- 
t ive . 
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Item Se l ection Procedure s 
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Since the criterion-referenced test ?s used to distinguish mastery 
or non-mastery of certain criterion behaviors, rather than t.o differentiate 
individuals in a group, several new item discrimination indices have been 
proposed. Cox and Vargas (1966) computed the percentage of students who 
passed an item on the posttest minus the percentage of those who passed 
an item on the pretest. Popham (1970) used chi-square to contrast the pre- 
and post-instruction relation of each item with hypothet leal frequencies 
based on the median value of each subtest. Rahmlow, Matthews, and Jung 
(1970) suggested the combined use of the difficulty level and the instruc- 
tion gain scores in analyzing cr i terion-i ef erenced test items. Although 
these procedures ate different, they have one thing in common — the use 
of instruction as a basis of discr iminaticn. 

The difficulty of using instruction as a basis of discrimination 
is that instruction is not necessarily equal to learning. Poor Instruction 
may have negative effects on item statistics. In terms of time and money, 
the tryout procedure of using instruction as a necessary component is not 
very economical. If the same test is used as both pre- and posttest, 
there is the question of whether the student just learned the specific 
items on the teat oi the general class of behaviors which the itercs sample. 

So far, there are no adequate statistical indices one may use 

to select items for criterion-referenced tests. I will not attempt to 
present any item selection data using new indices proposed for critarion- 
referenced rests. Instead, I will reexamine the meaning of criterion- 
referenced tests and see how classical item selection procedures may be 
applied. If the test is going to be used to provide '’explicit information 
as to what the Individual can or cannot do” (Glaser, 1963, p. 520), then 
a good criterion-referenced test item does not only discriminate pre- 
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end post "learning. 11 It is also the function of the item to allow the 
individual to answer correctly if he masters the criterion behavior 
represent by the item and answer incorrectly if he actually does not 
master it, regardless of whether the test is administered before or after 
formal instruction. 

Empirically, the person who masters a criterion behavior is the 
one who was declared to have mastery on the test* Therefore, in the tryout 
of items, we may use the predetermined cut-off score for each behavior to 
declare a mastery group and a non-mastery group. Then, for each item, 
we may obtain the proportion of subjects who answered correctly in the 
mastery group and the proportion of subjects who answered correctly in 
the non-mastery group. The difference of these two proportions (D^) 
is a meaningful discrimination index for criterion-referenced test items. 

An item which lias a larger proportion of correct responses in the non- 
mastery group certainly is not a good representation of its corresponding 
behavior . 

Another way to compute this type of discrimination is to use the 
phi ($) coefficient. By using right (1) or wrong (0) response with the 
mastery (1) or non-mastery (0) of the subject, the $ coefficient can be 
obtained easily. Although the coefficient lacks invariance properties, 
it is probably a better index than v hat of Tetrachovic correlation, since 
Tetrachoric correlation is very difficult to compute and the bivariate 
normal distribution snould be assumed. 

This phi ($], index is a minor modification of classical item 
total score correlations to fit the idea of criterion-referenced tests. 

The usefulness of the index deserves careful examination. First, ve will 
discuss the limitations of the this index. Then, a comparison of point 
bleerial correlation with the phi ($>) coefficient and difference in 
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proportions (D^) of correct response in mastery and non-mastery groups 
will be presented. 

The <|> coefficient is ambiguous and cannot be solved when: (a) the 

item is answered correctly or incorrectly by everyone, or (b) all subjects 
are declared as mastery or non-mastery. In these situations, the differ- 
ence in proportions between mastery or non-mastery groups (D^) or the point 
biseiial correlation coefficient (r^j) probably makes more sense. 

Empirical comparisons of r^., <{>» and Dp w^re designed to in- 
1 

ves tigate: 



(a) thu 
(b; the 



1) the relationship among r , , 4>, and D within a sample when 

Pb l p 

sample consists of subjects with wide variety o r abilities and 
sample consists of subjects with homogeneous ability skewed to 



one side; 

2) the consistency of r $ > and from one sample to another 
when (a) the samples have similar test score distributions, and (b) the 
samples have different test score distributions; and 

3) the relationship among Is and when items vary ii 

difficulty. 



Two similar studies were performed. In the first study, the pre- 

2 

and posttests of I P I math, D-sub traction were used. The pre- and posttests 
are equivalent but not identical. These tests were developed from the 
objectives presented in Table 1, five items for each objective. Data of 
IPI students taking these tests were obtained and compared to those of 
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*The author is indebted to Miss Betty Boston for her assistance 
in administering the tests. 

2 

IPI Mathematics, Developmental Edition, Apple ton-Century-Crof ts, 
1967. One of the objectives in D-subtr action was not used because it 
is a timed test. 
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aon-IPI students who took the same pre- and posttests. Descriptive 
statistics of the results are shown in Table 2. In the pretest, the score 
distributions for IPI students and non-IPI students are not too far apart. 
However, since the posttest for the IPI group was given after instruction, 
the scores of the IPI students on the posttest are far more homogeneously scat- 
tered to the right-hand side. For each item, r pbi> and were computed 
in each sample. Then, the inter correlat ions of these indices were calculated 
by the Pearson-Droduct moment correlation. These data are presented in 
Tables 3 and 4. 

In the second study, another two forms of the D- sub traction test 
for the same objectives were constructed, with four items for each objec- 
tive. Form A was administered to students in grades 3 and 4. Form B 
was administered to students in grades 2 and 3 in different schools. As 
shown from Table 5, the variations for the two groups that took Form A 
are not sjbstantial. The second grade group in Form B, Table 6, aca 
highly skewed to the left. The intercorrelations of r^^, <£» and for 
these two tests are given in Tables 7 and 8. 

From Tables 3, 4, 7, and 8, one may observe that when the sample 

i 

consists of subjects with a wide variety of abilities and with more sym- 
metrical distribution of test scores, r 4>» and are all highly 
correlated to each other. When the sample consists of homogeneous subjects 
and skewed to cither left or right, the correlations between r ^ and $ 

and between r and P^, though all significant at the 3 per cent level, 

are considerably lower. This trend i3 shown in the IPI group, Table 4, 
and again in Grade 2, Table 8. 

The consistency of r , and T> from one sample to another can 

p D i p 

be judged from Tables 3, 4, 7, and 8 too. The correlations between two 
samples for r pbi> and are all very high in Tables 3 and 7 and 

G 
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rather low in Tables 4 and 8. Evidently when samples having similar test 

score distribution, r , , , <f>» and D are all relatively consistent from 
pb i p 

one sample to mother. However, when samples have differently shaped score 

distributions, these discrimination statistics cannot be generalized from 

one sample to another. Thus, a highly discriminate item for a group "ith 

a wide variety of abilities is not necessarily a highly discriminate item 

for a selected group. Therefore, an identical Item may not measure the 

same type of performance in two different groups. In other words, if we 

attempt to tryout cest “.tens on a group which has a wide variety of abilities 

in order to apply these discrimination indices, and use this information to 

select high discrimination items for a second highly selected group, such as 

the IPI group in Table 4, we may not be choosing the appropriate items. 

What is the effect of item difficulty on the correlations among 

r pb i i $ 9 end D^? Items were grouped into three categories according to 

the index of difficulty: high (.7 and higher), middle (.4-.?), and low 

below .4). Four samples of correlations for these three indices were 

obtained. Table 9 shows tnat, as on.* may expect, the correlations ate 

consistently higher when items are in the middle difficulty. 

These empirical data show that $ and D are consistent with r , . 

P pbi 

in most cases. To use $ or D in the selection of items for norm-referenced 

P 

tests may not be justifiable because they tend to lose some information. 

For criterion-referenced tests, if our definition of good criterion-referenced 
test items is reasonable, then $ and are simple ways of delecting poorly 
discriminated items. It should be emphasized that any good index is useful 
only to the extent that it helps in dif f erent iatiug itei according to 
certain characteristics. It should not be used as the only basis for item 
elimination. Ultimately a human Judgement is required to decide whether 



0 

ERIC 



7 



7 



an item should be: revised or eliminated by considering the statistical 
properties of items and test scores in light of the purpose and the nature 
of the test. 



Rell ability o f C riterlon - Referenced Tests 
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It is generally known that the reliability coefficients computed 
from the traditional reliability formulas are affected by the heterogeneity 
of test scores. Since the criterion-referenced tests are not designed to 
produce variability in test scores, they obviously cannot avoid the problem 
of the homogeneity of test scores. Also, too much emphasis on the homo- 
geneity of items is not desirable either because it may reduce the validity 
of the test. 

To apply classical reliability formulas for a criterion-referenced 
test by disregarding different behavioral objective? within a test is 
evidently undesirable. Items for different objectives should be treated 
separately. If we can compromise the homogeneity of items within an 
objective with the heterogeneity across the objectives, the homogeneity 
of items is not necessarily a bad propercy for criterion-referenced measure- 
ment, In other words, if we can Increase traditional reliability for itemc 
within one objective but decrease the homogeneity across the objectives, 
we may be able to increase both reliability and validity at the same time. 
However, this concept needs further exploration and empirical investigation. 

To deal with the homogeneous subjects’ problem, Mvingston (1970) 
suggested an alternate type of reliability coefficient. Ho used the cut- 
off score that defines mastary, instead of the mean, to redefine deviation 
for criterion* refer *nced testing. The reliability coefficient computed 
by this method is at least as large as the norm-referenced reliability. 

The greater the difference between the mean and the cut-off score, the 
greater this reliability coefficient will be. Thus, changing the cut-off 

score will change this reliability coefficient. 
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Let us refer back to Table 6. Table 6 shows two groups with 

differences in score distributions. The reliability coefficijnts computed 

from KR20 become very low as the items become very difficult for the second 

grade group. Using a criterion of 75 per cent, the corresponding Livingston 
2 

coefficient (r c ) can be increased considerably. However, one may still 
question whether a test is going to be more reliable if the difference 
between the. mean and the cut-off score is maximized. 

Summary 

Criterion-referenced measurement represents an attempt to measure 
and to interpret human behaviors more meaningfully. In view of recent 
developments in individualized instruction and instructional technology, 
the traditional approach of comparing a student’s performance with his 
peers is not enough. This is especially true if the test results are going 
to be used to make a decision about further instruction. To be able to 
judge what a student can and cannot do, items that yield a bet;:er prediction 
of what a person can and ca n not do should be used *nd grouped together for 
the convenience of interpretation and item analysis. A mastery level should 
be determined for each criterion behavior (or objective) rather than judging 
a group of objectives as a whole. The mastery level will not necessarily be 
the same for every objective in a test. 

A criterion-referenced test designed in this manner can be greatly 
facilitated by the item analysis procedures and the application of classical 
reliability theory. We examined the possibility of using the phi (j>) 
coefficient and the difference in proportions of correct responses between 
mastery and non-mastery groups (D^) a9 discrimination indices for criterion- 
referenced test items. These two indices were also compared empirically 
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with the point biserial correlation (r^^) between item and test score 
under three different situations. Results showed that these indices are 
highly correlated in most cases. However, because of the inconsistency 
of these indices from a group with wide variety of abilities to a highly 
selected group, the items selected for one group may not be measuring the 
same kind of performance in a second more homogeneous group. Therefore, 
the procedure of trying out test items in a group with wide variety of 
abilities in order to apply these indices in selecting items for criterion- 
referenced tests is not recommended. 
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Table 1 



A Sample Structure of a Criterion-Referenced Test 



Objective 



No. of Items Mastery Level 



I. Does subtraction without 5 80% 

borrowing for numbers with 
three or more digits. 



II. Subtracts with borrowing 5 80% 

from tens place using two- 
digit numbers. 

III. Subtracts with borrowing 5 80% 

from tens or hundreds place 
with three-digit numbers. 



IV. Subtracts with borrowing 5 80% 

from tens and hundreds 
place with three-digit 
numbers . 



The objectives are taken from IPI math continuum, 1968-69 (working copy). 
Learning Research and Development Center, University of Pittsburgh. 
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Table 2 

Descriptive Statistics of IPI and Non-lPI Subjects 
Using the Same Pre- and Posttests of IPI Math* 
D~Subtract ion 



IPI Non-IPI 

(N=77) (N=78) 

Objective No. of St. St. 





I.D. 


Item 


Mean 


Dev. 


KR20 


Mean 


Dev. 


KR20 




I 


5 


A. 45 


1.09 


.75 


4.51 


1.05 


.77 




II 


5 


2.79 


2.09 


.90 


1.99 


2.33 


.97 


Pre 


III 


5 


2.42 


2.32 


.96 


1.78 


2.27 


.97 




IV 


5 


1.90 


2.11 


.93 


1.36 


1.96 


.93 










(K«59) 






(N=78) 






I 


5 


4.69 


.73 


.59 


4.58 


.89 


.65 




II 


5 


4.47 


1.19 


. 8 ? 


1.69 


2.28 


.98 


Post 


III 


5 


4.00 


1.36 


.72 


1.62 


2.17 


.96 




IV 


5 


3.64 


1.50 


.72 


1.10 


1.77 


.91 
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Table 3 



Intercurrelations of r , (f>, D frotr. IPI and Non-IFI Subjects 

pt> 1 p 

Using the Same Pretest of IPI Math, 

D-Subt ract ion (N = 20 items) 




Note: 5% = .*AA, IX = .561 



Table 4 



Intercorrelations of r ,, , D from IPI and Non-IPI 

pbi* p 

Using the Same Posttest of IPI Math, 
D-Subt ract ion (N = 20 items) 



Sub j ects 



Posttest (IPI) 



Posttest (Non-IPI) 



1 . 



r pbl 



2 . * 



3. D 

P 



A . 



r pbl 



5. 



6. D 

P 
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Descriptive Statistics of Subjects 
Taking Test Form A 













Grade 3 






Grade A 










N=A9 






N=59 






No. of 




St. 






St. 




Ob ject ive 


I terns 


Mean 


Dev. 


KR20 


Mean 


Dev. 


KR20 


I 


A 


3. A7 


.92 


.62 


3.90 


.30 


1 

o 

Cn 


II 


A 


2.10 


1.86 


.95 


2.78 


1.57 


.37 


III 


A 


1.92 


1.91 


.97 


2.66 


1.70 


.92 


IV 


A 


1.33 


1.68 


.92 


1.95 


1.78 


.91 



Table 6 

Reliability Coefficients Computed from KR20 and Their Corresponding 
Criterion-Referenced Reliability Coefficients for Tect Form B 



Objective 


No. of 
I terns 


Mastery 

Level 


Mean 


Grade 2 
N-69 

St. 

Dev. KR20 


2 

r 

c 


Mean 


Grade 3 
N“110 

St. 

Dev. KR20 


2 

r 

c 


I 


A 


75% 


2.A9 


1.A9 


.80 


.82 


3.A6 


.99 


.71 


.76 


II 


A 


75X 


.28 


.82 


.83 


.98 


2.75 


1.52 


.8A 


.85 


III 


A 


75X 


.1A 


. A6 


. A7 


.99 


2.53 


1. A0 


.72 


. 7A 


IV 


A 


75% 


.07 


.31 


.39 


.99 


2.15 


1.A8 


. 7A 


.80 
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Table 7 



Intercorrelations of r , . , <fc, and D 
pb i p 

Grade 3 and Grade 4 Subjects 
Using Test Form A 
(N = 16 items) 



f rom 



Grade 3 Grade 4 




Note: 5% = .497, 1% = .623 



Table 8 



Intercorrelations of r , , , d>. and D 

pbi’ p 

Grade 2 and Grade 3 Subjects 
Using Test Form B 
(N ■* 16 items) 



from 




Table 9 



The Intercorrelations of r , 6, and D 

pbi* 

When Items Vary In Difficulty 



P 





High 


r P bi x < 
Middle 


) 

Low 


High 


<p x D 

P 

Middle 


Low 


High 


r . . x D 
pbi 

Middle 


P 

Low 




* 


* 




A 


* 


A 




* 




Sample 1 


.92 


.77 


.12 


.83 


.99 


.79 


.76 


.79 


.38 


Sample 2 


* 

.73 


* 

.68 


-.42 


-.45 


.81* 


. 82 + 


-.32 


* 

.88 


.14 




* 


* 


it 


+ 


* 


* 


+ 


A 


* 


Sample 3 


.93 


.91 


.80 


.69 


.98 


.92 


.70 


.91 


.93 






* 




* 


* 


* 




* 




Sample 4 


.46 


.93 


.43 


.70 


.99 


.98 


.50 


.88 


.44 



+ - Significant at 
* - Significant at 1% 
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